Arabic speech recognition by end-to-end, modular systems and human
نویسندگان
چکیده
Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers debate if the machine has reached performance. Previous work focused on English language and modular hidden Markov model-deep neural network (HMM–DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, HMM–DNN (HSR) Arabic its dialects. For HSR, evaluate linguist performance lay-native speaker new dataset collected as part of study. ASR 12.5%, 27.5% , 33.8% WER; milestone MGB2, MGB3, MGB5 challenges respectively. Our results suggest that is still considerably better than with an absolute WER gap 3.5% average.
منابع مشابه
Improved training for online end-to-end speech recognition systems
Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training. Otherwise, the networks may fail to find a good local optimum. This is particularly true for low-latency online networks, such as unidirectional LSTMs. Currently, the best strategy to train such systems is to bootstrap the training from a tied-triphone system. However, this is...
متن کاملEnd-to-end Audiovisual Speech Recognition
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-toend audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the ...
متن کاملEnd-to-End Speech Recognition Models
For the past few decades, the bane of Automatic Speech Recognition (ASR) systems have been phonemes and Hidden Markov Models (HMMs). HMMs assume conditional independence between observations, and the reliance on explicit phonetic representations requires expensive handcrafted pronunciation dictionaries. Learning is often via detached proxy problems, and there especially exists a disconnect betw...
متن کاملMultichannel End-to-end Speech Recognition
The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend ...
متن کاملTowards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which al...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computer Speech & Language
سال: 2022
ISSN: ['1095-8363', '0885-2308']
DOI: https://doi.org/10.1016/j.csl.2021.101272